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We study three wave function optimization methods based on energy minimization in a variational 
Monte Carlo framework: the Newton, linear and perturbative methods. In the Newton method, the 
parameter variations are calculated from the energy gradient and Hessian, using a reduced variance 
statistical estimator for the latter. In the linear method, the parameter variations are found by 
diagonalizing a nonsymmetric estimator of the Hamiltonian matrix in the space spanned by the wave 
function and its derivatives with respect to the parameters, making use of a strong zero-variance 
principle. In the less computationally expensive perturbative method, the parameter variations are 
calculated by approximately solving the generalized eigenvalue equation of the linear method by a 
nonorthogonal perturbation theory. These general methods are illustrated here by the optimization 
of wave functions consisting of a Jastrow factor multiplied by an expansion in configuration state 
functions (CSFs) for the C2 molecule, including both valence and core electrons in the calculation. 
The Newton and linear methods are very efficient for the optimization of the Jastrow, CSF and 
orbital parameters. The perturbative method is a good alternative for the optimization of just the 
CSF and orbital parameters. Although the optimization is performed at the variational Monte Carlo 
level, we observe for the C2 molecule studied here, and for other systems we have studied, that as 
more parameters in the trial wave functions are optimized, the diffusion Monte Carlo total energy 
improves monotonically, implying that the nodal hypersurface also improves monotonically. 



I. INTRODUCTION 

Quantum Monte Carlo (QMC) methods (see e.g. 
Refs. 1-3) constitute an alternative to standard ab ini- 
tio methods of quantum chemistry for accurate calcula- 
tions of the electronic structure of atoms, molecules and 
solids. The two most commonly used variants, varia- 
tional Monte Carlo (VMC) and diffusion Monte Carlo 
(DMC), rely on an explicitly correlated trial wave func- 
tion, generally consisting for atoms and molecules of a 
Jastrow factor multiplied by a short expansion in config- 
uration state functions (CSFs), each consisting of a linear 
combination of Slater determinants, a form capable of en- 
compassing most of the electron correlation effects. To 
fully benefit from the considerable flexibility in the form 
of the wave function, it is crucial to be able to efficiently 
optimize the parameters in these wave functions. 

Variance minimization in correlated sampling [4-6] has 
become the most frequently used method in QMC for 
optimizing wave functions because it is far more effi- 
cient than straightforward energy minimization on a fi- 
nite Monte Carlo sample. However, while the method 
works relatively well for the optimization of the Jastrow 
factor, it is much less effective for the optimization of 
the determinantal part of the wave function (though still 
possible [4, 7, 8]). Further, there is some evidence that 
energy-optimized wave functions give on average better 
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expectation values for other observables than variance- 
optimized ones (see, e.g., Refs. 9, 10). As a result, a lot 
of effort has recently been devoted to developing efficient 
methods for the optimization of QMC wave functions by 
energy minimization. On the other hand, it should be 
mentioned that variance-minimized wave functions often 
have a smaller time-step error in DMC. 

We now summarize some of the major approaches that 
have been proposed for energy minimization in VMC. 
The most efficient method to minimize the energy with 
respect to linear parameters, such as the CSF coefficients, 
is to solve the associated generalized eigenvalue equa- 
tion using a non-symmetric estimator of the Hamilto- 
nian matrix [11]. The energy fluctuation potential (EFP) 
method [12-16] is very efficient for optimizing some non- 
linear parameters and has been applied very successfully 
to the optimization of the orbitals [13, 16] and CSF co- 
efficients [15, 16]. It has also been applied to the opti- 
mization of Jastrow factors in periodic solids [14]. The 
perturbative EFP method, a simplification of the EFP 
method, retains the same convergence rate for the opti- 
mization of the orbitals and CSF coefficients while de- 
creasing the computational cost [17]. The stochastic re- 
configuration (SR) method, originally developed for lat- 
tice systems [18], has been applied to the full optimiza- 
tion of atomic and molecular wave functions consisting 
of an antisymmetrized geminal power part multiplied by 
a Jastrow factor [19, 20]. It is related to the perturbative 
EFP method and is simpler but less efficient [16, 17]. The 
Newton method is a conceptually simple and general op- 
timization method but a straightforward implementation 
of it in QMC is rather inefficient [21, 22]. However, an 
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improved version of it, making use of a reduced variance 
estimator of the Hessian matrix [23] , is very efficient for 
the optimization of Jastrow factors. Another modified 
version of the Newton method with an approximate Hes- 
sian, named stochastic reconfiguration with Hessian ac- 
celeration (SRH), has been applied to lattice models [24]. 

In this work, we investigate the three best energy min- 
imization methods for the optimization of the Jastrow, 
CSF and orbital parameters of QMC wave functions: the 
Newton, linear and perturbativc methods. The New- 
ton method has already been applied very successfully 
to the optimization of Jastrow factors by Umrigar and 
Filippi [23] , and in this paper it is also applied to the opti- 
mization of the determinantal part of the wave function. 
The linear method is an extension of the zero-variance 
generalized eigenvalue equation approach of Nightingale 
and Melik-Alaverdian [11] to arbitrary nonlinear param- 
eters: at each step of the iterative procedure, the wave 
function is linearized with respect to the parameters and 
the optimal values of the parameters are found by diago- 
nalizing the Hamiltonian in the space spanned by the cur- 
rent wave function and its derivatives with respect to the 
parameters. This method is briefly presented in Ref. 25. 
The perturbative method coincides with the perturbative 
EFP method of Sccmama and Filippi [17] for the opti- 
mization of the CSF and orbital parameters. Here, we 
put this approach on more general grounds by recasting 
it as a simplification of the linear method where the gen- 
eralized eigenvalue equation is solved approximately by 
a nonorthogonal perturbation theory. The Newton and 
linear methods are very efficient for the optimization of 
the Jastrow, CSF and orbital parameters. The perturba- 
tive method is a good alternative for the optimization of 
just the CSF and orbital parameters. 

The paper is organized as follows. In Sec. II, the 
paramctrization of the trial wave function is presented. 
The energy minimization procedures are discussed in 
Sec. Ill, and their realizations in VMC are discussed in 
Sec IV. Sec. V contains computational details of the cal- 
culations performed on the C2 molecule to test the opti- 
mization methods, and in Sec. VI we present the results. 
Sec. VII contains our conclusions. 

Hartree atomic units (Ha) are used throughout this 
work. 



II. WAVE FUNCTION PARAMETRIZATION 
AND DERIVATIVES 



We begin by describing the form of the wave function 
used, the actual parametrization chosen for the optimiza- 
tion, and the corresponding derivatives of the wave func- 
tion with respect to the parameters. 



A. Form of the wave function 

We use an A^-electron wave function of the usual 
Jastrow-Slater form that is denoted at each iteration of 
the optimization procedure by 

|*o} = J(a )|$o), (1) 

where J(a°) is a Jastrow operator depending on the cur- 
rent parameters a" and |$ ) is a multi-determinantal 
wave function. For notational convenience, we assume 
that the wave function \^o) is always normalized to unity, 
i.e. (^ol^o) = 1- In practice, l^o) can have arbitrary 
normalization. 

The wave function |<l>o} is a linear combination of Vcsf 
orthonormal configuration state functions (CSFs), |CV), 
with current coefficients c°, 

Ncsf 

l*o) = £ c?|Ci>. (2) 
1=1 

Each CSF is a short linear combination of products of 
spin-up and spin-down Slater determinants, |£>£) and 

\ D t), \Ci) = H^ d iM\ D i)\ D i)^ whcrc thc coefficients 
rf/.k are fully determined by the spatial and spin sym- 
metries of thc state considered (see, e.g., Ref. 26). 
The use of CSFs is important to decrease the num- 
ber of coefficients to be optimized and to ensure the 
correct symmetry of the wave function after optimiza- 
tion in the presence of statistical noise. The iVj- 
electron and TVj-electron spin-assigned Slater determi- 
nants are generated from a set of current orthonormal 
orbitals, \Dl) = 4 iT 4 2 t'" 4„ T Tl vac ) and l^k) = 

a L T+1 i a L J+2 i • • • & Li\ y&c )> where a L ( with a =t»i-) 

is the fermionic creation operator for the spatial orbital 
|</>2) in the spin-tr determinant, and |vac) is the vacuum 
state of second quantization. The (occupied and virtual) 
spatial orbitals are written as linear combinations of Nb as 
basis functions \xn) (e.g., Slater or Gaussian functions) 
with current coefficients A^, \<j>l) = Agjx„>. 

The V-electron Jastrow operator, J(a°), is defined 
by its matrix elements in the TV-electron position basis 
|R) = |ri,r 2 , ...,r N ) 

(R|J(a°)|R') = J(a°; R)J(R-R'), (3) 

where J(ot°; R) is the spin-assigned Jastrow factor, a real 
positive function of R which is symmetric under the ex- 
change of two same-spin electrons. Its action on an ar- 
bitrary iV-elcctron state |$) is given by (R|J(o; )|$) = 
J(a°;R)$(R) whcrc $(R) = (R|$). The Jastrow oper- 
ator is Hermitian, J(a: )t = J(a°). We use flexible Jas- 
trow factors consisting of the exponential of the sum of 
electron- nucleus, electron-electron and electron-electron- 
nucleus terms, written as systematic polynomial or Pade 
expansions [27] (see also Refs. 7, 28). 
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B. Wave function parametrization 

We want to optimize the Jastrow parameters cti, the 
CSF coefficients cj and the orbital coefHcients Afe „. Some 
parameters in the Jastrow factor are fixed by impos- 
ing the electron-nucleus and electron-electron cusp con- 
ditions [29] on the wave function; the other Jastrow 
parameters are varied freely. Due to the arbitrariness 
of the overall normalization of the wave function, only 
Nqsf — 1 CSF coefficients need be varied, e.g., the coeffi- 
cient of the first configuration can be kept fixed. The 
situation is more involved for the orbital coefficients 
which are not independent due to the invariance prop- 
erties of determinants under elementary row operations. 
To easily retain only unconstrained, nonrcdundant or- 
bital parameters in the optimization, it is convenient 
to vary the orbital coefficients by performing rotations 
among the (occupied and virtual) orbitals with a uni- 
tary operator parametrized as an exponential of an anti- 
Hcrmitian operator. This parametrization is used in 
multi-configuration self-consistent-field (MCSCF) calcu- 
lations (for a recent and general review of MCSCF the- 
ory, see Ref. 30). More specifically, we use the following 
parametrization of the wave function depending on 
Jastrow parameters a, N^ F — Nqsf — 1 free CSF coef- 
ficients c (a is fixed), and orbital rotation param- 
eters K 



|*(a,c, k)) 



J(a)e*(") ]T aid), 



(4) 



where is the unitary operator that performs rota- 

tions in orbital space (see, e.g., Refs. 31, 32). More elab- 
orate parametrizations of the CSF coefficients, such as a 
unitary parametrization [33], are often used in MCSCF 
theory (see, e.g., Ref. 32), but we have not found any 
decisive advantage to using them for our purpose. 

The rotations in orbital space are generated by the 
anti-Hermitian real singlet orbital excitation opera- 
tor [34] 



k(k) 



k<l 



(5) 



where the sum is over all nonredundant orbital pairs, 
E a = E ki - E ik, and E M = o^ajf + a^ajj. is the singlet 
excitation operator from orbital I to orbital k. In Eq. (4), 
the action of the operator e K ( K ) is to rotate each occupied 
orbital in the Slater determinants as 



l^>=e* W l/ fe >=£(0^?>, 



be written as an exponential of an anti-Hermitian ma- 
trix, the off-diagonal upper triangular part of the anti- 
Hermitian matrix realizing a nonredundant parametriza- 
tion of the unitary matrix. To maintain the orthonor- 
mality of the entire set of orbitals, the operator e K ^ is 
applied to the virtual orbitals as well. For a single Slater 
determinant wave function, the orbitals can be parti- 
tioned into three sets referred to as closed (i.e., doubly 
occupied), open (i.e., singly occupied) and virtual (i.e., 
unoccupied). The nonredundant excitations to consider 
are then: closed — ► open, closed — * virtual and open — > 
virtual. For a multiconfiguration wave function, the or- 
bitals can be partitioned into three sets referred to as 
inactive (i.e., occupied in all determinants), active (i.e., 
occupied in some determinants and unoccupied in the 
others) and secondary (i.e., unoccupied in all determi- 
nants). For a multiconfiguration complete active space 
(CAS) wave function [35], the nonredundant excitations 
are then: inactive — > active, inactive — > secondary and 
active — > secondary. For a single-determinant and multi- 
determinant CAS wave function, the action of the reverse 
excitation from orbital k to I (Eik) in E^ = Eki — Eik is 
always zero. For a general multiconfiguration wave func- 
tion (not CAS), some active — > active excitations must 
also be included. Consequently, the action of the re- 
verse excitation Eik in E^ = Eki — Eik docs not gen- 
erally vanish. Only excitations between orbitals of the 
same spatial symmetry have to be considered. In the 
super configuration interaction approach [36] where the 
orbitals are optimized by adding the single excitations 
of the (multiconfiguration) reference wave function to 
the variational space, pioneered in QMC by Filippi and 
coworkers [16, 17], an alternative linear parametrization 
of the orbital space is chosen, \<f>k) = (l + |</>°), in- 

stead of the unitary parametrization of Eq. (6). In that 
case, the optimized orbitals are not orthonormal. 

In the following, we will collectively refer to the Jas- 
trow, CSF and orbital parameters as p = (a, c, n). The 
wave function of Eq. (1) is thus simply | \I> ) = ^(p )) 
where p° = (a°,c ,K° = 0) are the current parameters 
We will designate by A opt = + N£f F 

total number of parameters to be optimized 



NX the 



C. First-order wave function derivatives 

We now give the expressions for the first-order deriva- 
tives of the wave function |\&(p)) of Eq. (4) with respect 
to the parameters pi at p = p° 



(6) 



l*i> 



a|*(p)) 

dpi 



(7) 



p=p" 



where the sum is over all (occupied and virtual) orbitals, 
and (e K )ki are the elements of the orthogonal matrix e K 
constructed from the real anti-symmetric matrix k with 
elements km- More generally, any unitary matrix can 



which collectively designate the derivatives with respect 
to the Jastrow parameters 



!*«*> = -£-4*0). 



(8) 
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with respect to the CSF parameters 

|* CJ ) = J( a °)\Cr), (9) 

and with respect to the orbital parameters 

\* Kkl ) = J( a °)E^ ). (10) 

The first-order orbital derivatives are thus generated by 
the single excitations of orbitals out of the state |$o). 

D. Second-order wave function derivatives 

The second-order derivatives with respect to the pa- 
rameters pi at p = p°, which are needed only for the 
Newton method, are 



l*«> 



3 2 |tt(p)) 
9pidpj / p . p ,: 



(11) 



which collectively designate the Jastrow-Jastrow deriva- 
tives 



IT V 52j (« ),^V 

l*°«°i)= . l*o), 



dondoij 



the Jastrow-CSF derivatives 



a/(q°) 



the Jastrow-orbital derivatives 



ai(a°) 



the CSF-orbital derivatives 



I*, 



and the orbital-orbital derivatives 



(12) 



(13) 



(14) 



(15) 



(16) 



Notice that the wave function form of Eq. (4) is linear in 
the CSF parameters and therefore the CSF-CSF deriva- 
tives are zero, \^f CICJ ) = 0. The orbital-orbital deriva- 
tives correspond to double excitations of orbitals out of 
the state |3>o)- Since we usually start the optimization 
with reasonably good initial orbitals coming from a stan- 
dard MCSCF calculation we set these second derivatives 
to zero, \^K ki K mn ) = 0, in order to reduce the compu- 
tational cost per iteration during Newton minimization. 
Nevertheless, it takes only a few steps to optimize the 
orbitals as discussed in Sec. VI. 



III. ENERGY MINIMIZATION PROCEDURES 

In this section, we present the three methods inves- 
tigated in this work to minimize the variational energy 
with respect to the wave function parameters p 



E = min.E(p), 
p 



(17) 



where E(p) = (*(p)|fr|*(p))/(*(p)|*(p)) and H = 
T + W ee + V ne is the electronic Hamiltonian, includ- 
ing the kinetic, electron-electron interaction and nuclei- 
electron interaction terms. The Hamiltonian can also in- 
clude a nonlocal pseudopotential, enabling one to avoid 
the explicit treatment of core electrons. The energy cor- 
responding to the current parameters p° will be denoted 
by E = E(p°). 



A. Newton optimization method 

The Newton method was first applied to the opti- 
mization of QMC wave functions by Rappe and cowork- 
ers [21, 22]. It has been considerably improved by Umri- 
gar and Filippi [23] , and by Sorclla [24] , by making use of 
a lower variance statistical estimator of the Hessian ma- 
trix and by employing stabilization techniques. In Ref. 23 
the correct Hessian was used, whereas in Ref. 24 an ap- 
proximate Hessian, which reduces to the exact Hessian 
for parameters that are linear in the exponent, was used. 
We now recall the basic working equations. 

The energy E(p) is expanded to second-order in the 
parameters p around p° 

7Y°p' N° pt N° pt 



i=l j=l 



(18) 



where the sums are over all the parameters to be opti- 
mized, Api = Pi — p® are the components of the vector 
of parameter variations Ap, 



9i = 



V Opi J p=p0 



(19) 



are the components of the energy gradient vector g, and 

32 , 



dPidPj J p= p o ' 



(20) 



are the elements of the energy Hessian matrix h. Im- 
position of the stationarity condition on the expanded 
energy expression, dE^(p)/dpi — 0, gives the following 
standard solution for the parameter variations 



Ap = h 1 • g, 



(21) 



where h 1 is the inverse of the Hessian matrix. In prac- 
tice, the energy gradient and Hessian are calculated in 
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VMC with the statistical estimators given in Sec. IV A, 
yielding the parameter variations Ap of Eq. (21) that 
are used to update the current wave function, \^o) — > 
\^(p° + Ap)). It simply remains to iterate until conver- 
gence. 

Stabilization. As explained in Ref. 23, the stabilization 
of the Newton method is achieved by adding a positive 
constant, a diag > 0, to the diagonal of the Hessian matrix 
h, i.e., hij — > hij + a d ia g <% . As a diag is increased, the pa- 
rameter variations Ap become smaller and rotate from 
the Newtonian direction to the steepest descent direc- 
tion. A good value of adiag is automatically determined 
at each iteration by performing three very short Monte 
Carlo calculations using correlated sampling with wave 
function parameters obtained with three trial values of 
adiag and predicting by parabolic interpolation the value 
of a dia g that minimizes the energy [25] , with some bounds 
imposed. The use of correlated sampling makes it pos- 
sible to calculate energy differences with much smaller 
statistical error than the energies themselves. This pro- 
cedure helps convergence if one is far from the minimum 
or if the statistical noise is large in the Monte Carlo eval- 
uation of the gradient and Hessian. 

We have found that adding in a multiple of the unit 
matrix to the Hessian as described above works well, but 
there exist other possible choices of positive definite ma- 
trices that could be added in. For instance, Sorella [24] 
adds in a multiple of the overlap matrix of the first-order 
derivatives of the wave function. Another possible choice 
is a multiple of the Levenberg-Marquardt approximation 
to the Hessian of the variance of the local energy. 



B. Linear optimization method 

The most straightforward way to energy-optimize lin- 
ear parameters in wave functions, such as the CSF pa- 
rameters, is to diagonalize the Hamiltonian in the vari- 
ational space that they define, leading to a generalized 
eigenvalue equation. This has been done in QMC for ex- 
ample in Rcfs. 11, 37. The linear method that we present 
now is an extension of the approach of Ref. 1 1 to arbitrary 
nonlinear parameters. This method is also presented in 
Ref. 25, using slightly different but equivalent conven- 
tions. 

For notational convenience, we first introduce the nor- 
malized wave function 



I*(p)> = 



I*(P)> 



v/(*(p)I*(p)> 



(22) 



The idea is then to expand this normalized wave func- 
tion |^(p)) to first-order in the parameters p around the 
current parameters p°, 



jyopt 



|*lin(p)> = |*0> + E ^ I**)' 



(23) 



where the wave function at p = p° is simply ^(p )) = 
|*o) = |*o) (chosen to be normalized to 1) and, for i > 1, 
\^i) are the derivatives of |W(p)) that are orthogonal to 
l*o> 



3|*(P)) 

dpi 



|*<> - S 0i |* >, (24) 



where Soi = (^o\^i)- The minimization of the energy 
calculated with this linear wave function 



where 



E lin = min£i in (p), 
p 



(vE lin (p)|g|vl/ lin (p)) 

(*l in (p)|*l in (p)) 



(25) 



(26) 



leads to the stationary condition of the associated La- 
grange function 

V p [(* lin ( P )|i?|¥ lin (p)} - £fe„(*lin(p)|*li»(p))] = 0, 

(27) 

where En n acts as a Lagrange multiplier for the normal- 
ization condition. The Lagrange function is quadratic in 
p and Eq. (27) leads to the following generalized eigen- 
value equation 



H Ap = E Vin S • Ap, 



(28) 



where H is the matrix of the Hamiltonian H in 
the (V opt -(- l)-dimensional basis consisting of the 
current normalized wave function and its derivatives 
{|* >,|*i>,|*2>,--- , |*iv°pt)}, with elements = 
(^i\H^j), S is the overlap matrix of this (A opt + 1)- 
dimensional basis, with elements Sij = (^>i\^j) (note 
that Soo = 1 and Sio = Soi — for i > 1), and Ap is 
the (V opt + l)-dimcnsional vector of parameter variations 
with Apo = 1. The linear method consists of solving the 
generalized eigenvalue equation of Eq. (28), for the lowest 
(physically reasonable) eigenvalue and associated eigen- 
vector denoted by Ap. The overlap and (non-symmetric) 
Hamiltonian matrices are computed in VMC using the 
statistical estimators given in Sec. IV B. Although we 
focus here on the optimization of the ground-state wave 
function, solving Eq. (28) also gives upper bound esti- 
mates of excited state energies of states with the same 
spatial and spin symmetries. 

However, there is an arbitrariness in the previously de- 
scribed procedure: we have found the parameter varia- 
tions Ap from the expansion of the wave function |^(p)) 
of Eq. (22), but another choice of the normalization of 
the wave function will lead to different parameter varia- 
tions. To see that, consider a differently- normalized wave 
function 



|*(p))=JV(p)|*(p)), 



(29) 
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where the normalization function N(p) is chosen to sat- 
isfy N(p°) = 1 so as to leave unchanged the normaliza- 
tion at p = p°, i.e. ^(p )) = |*o>- The derivatives of 
this new wave function are 



( dWp)) \ 

{ d Pl ) 



= |*i>+JVi|*o>, (30) 



where Ni = {dN(p)/dpi) p=p0 , i.e. their projections onto 
the current wave function |\&o) depend on the normaliza- 
tion. Consequently, the first-order expansion of this new 
wave function 

jyopt 

llin(p)) = |*0> + E A ^ 1^*)' ( 31 ) 



leads, after optimization of the energy, to different opti- 
mal parameter variations Ap. As the two wave functions 

l^iin(p)) and |\&iin(p)) he in the same variational space, 
they must be proportional after minimization of the en- 
ergy, which implies that the new optimal parameter vari- 
ations Ap are actually related to the original optimal 
parameter variations Ap by a uniform rcscaling 



Ap = 



Ap 



1 



(32) 



Any choice of normalization docs not necessarily give 
good parameter variations. For the CSF parameters, it 
is obvious that the best choice is the normalization of the 
wave function of Eq. (4) in order to keep the linear de- 
pendence on these parameters, ensuring convergence of 
the linear method in a single step. This is achieved by 
choosing \SS?i) = which gives 



Ni = Sm, for linear parameters. 



(33) 



For the nonlinear Jastrow and orbital parameters, several 
criteria are possible. We have found that a good one is to 
choose the normalization by imposing that, for the vari- 
ation of the nonlinear parameters, each derivative is 
orthogonal to a linear combination of \^q) an d l^iin), i.e. 

(*i|£*o + (1 - 0*iin/||*iin||) = 0, where £ is a constant 
between and 1, resulting in 



Ni = -- 



(1 



OEj A Pj S i 



(i-O. s v . . 

for nonlinear parameters 



^/i+Er fe nlin A^A^ fe 



(34) 



where the sums are only over the nonlinear Jastrow and 
orbital parameters. The simple choice £ = 1 first used by 
Sorclla [18] in the context of the SR method leads in many 
cases to good parameter variations, but in some cases can 
result in parameter variations that are too large. The 
choice £ = making the norm of the linear wave function 

change H^iin — ^oll minimum is safer but in some cases 



can yield parameter variations that are too small. In 

those cases, the choice £ = 1/2, imposing ||^ii n || = ll^oll, 
avoids both too large and too small parameter variations. 
In particular, if Ap = oo, meaning that "fun is orthogo- 
nal to it follows from Eqs. (32) and (34) that Ap is 
zero for £, — but Ap is nonzero and finite for £ = 1/2. 
In practice, all these three choices for £ usually lead to a 
very rapid convergence of the nonlinear parameters. In 
contrast, choosing the original derivatives, i.e. Ni = Sm, 
leads to slowly converging or diverging Jastrow parame- 
ters. 

Stabilization. Similarly to the procedure used for the 
Newton method, we stabilize the linear method by adding 
a positive constant, adiag > 0, to the diagonal of H except 
for the first element, i.e. — > + adiag<%(l — #io)- 
Again, as adiag becomes larger, the parameter variations 
Ap become smaller and rotate toward the steepest de- 
scent direction. The value of adiag is then automatically 
adjusted in the course of the optimization in the same 
way as in the Newton method. Note that if instead we 
were to add adiae to S H then it would be the "level- 



— EFP 

toman has matrix elements ti ; , 

E^r(s _i )i fc <* fc |ffi*o)[(i 



shift" parameter commonly used in diagonalization pro- 
cedures. We prefer to add to H, in part, because it is not 
necessary to compute S • H in order to solve Eq. (28). 

Connection to the EFP method. The generalized 
eigenvalue equation of Eq. (28) can be re-written 

as an eigenvalue equation, H • Ap = £a; n Ap, 

where H = S • H, i.e. with matrix elements 

H'ij = E£r(S _1 )ifc(*fc|A|*j>- This form is 
useful to establish the connection with the EFP 
optimization method for the CSF and orbital parame- 
ters [13, 15, 16]. This latter approach consists of solving 
at each iteration the effective eigenvalue equation, 

H • Ap = £ EFP Ap, where the EFP effective Hamil- 

„ - WHpWij + 

Sio)5oj + <5io(l - Soj)], 
where designates the current wave function 

and its derivatives without the Jastrow factor, i.e. 
^j) = J(a°)|$j), and (fy k \H\lj ) arc just the compo- 
nents of half the gradient of the energy. Hence, in the 
EFP method, only the off-diagonal elements in the first 
column and first row calculated from the components of 

... TT EFP 

the energy gradient are retained m H 

Connection to the Newton and SRH methods. In the lin- 
ear method, the energy expression that is minimized at 
each iteration, Eu n (p), contains all orders in the parame- 
ter variations because of the presence of the denominator 
in Eq. (26), though only the zeroth- and first-order terms 
match those of the expansion of the exact energy E(p). 
In contrast, in the Newton method, the energy expres- 
sion of Eq. (18), E^(p), is truncated at second order 
in Ap but is exact up to this order. Now, if instead of 
solving the generalized eigenvalue equation (28) , one ex- 
pands the energy expression of Eq. (26) to second order 
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in Ap, one recovers the Newton method with an approx- 
imate (symmetric) Hessian h 1 ^ = Hij + Hji — 2EoSij 
corresponding exactly to the SRH method with /? = of 
Ref. 24. The SRH method is much less stable and con- 
verges more slowly than either our linear method or our 
Newton method for the systems studied here. 



with the normalization condition (^ol^iin) = 1 f° r all ^> 
so that, for A = 1, Eq. (38) reduces to Eq. (37): H x=1 = 
H, l^r 1 ) = l*iin>, E^ 1 = E hn , and we partition the 
Hamiltonian H x as follows 



h x = h(°) + \hW. 



(39) 



C. Perturbative optimization method 

The perturbative method discussed next is identical 
to the perturbative EFP approach of Scemama and Fil- 
ippi [17] for the optimization of the CSF and orbital pa- 
rameters, provided that the same choice is made for the 
energy denominators (see below). We give here an al- 
ternate proof without introducing the concept of energy 
fluctuations that in principle extends the method to other 
kinds of parameters as well. 

Instead of calculating the optimal linearized wave func- 
tion l^iin) by diagonalizing the Hamiltonian H in the 
subspace spanned by {Jv^o), l^i), ^2}, • • • , ^Afp*)}; we 
formulate a nonorthogonal perturbation theory for ['J/im)- 
The textbook formulation of perturbation theory starts 
from the Hamiltonian H whose eigenstates we wish to 
compute, and a zeroth order Hamiltonian whose 
eigenstates are known. Instead, here we start with H 
and the states {|*o), |*2), • • ■ , |*jv°pt)}, and de- 
fine a zeroth order operator for which these states 
are right eigenstates. To do this, we introduce {{^il}, 
the dual (biorthonormal) basis of the basis {J'J'j}}, i.e. 
= Sij, given by (see, e.g., Ref. 38) 



jyopt 

£(s" 

3=0 



(35) 



where (S )y are the elements of the inverse of the over- 
lap matrix S, and we introduce the non-Hcrmitian pro- 
jector operator onto this subspace 



^yopt 

E 

i=0 



(36) 



The optimal linearized wave function, minimizing the en- 
ergy [Eq. (25)], satisfies the projected Schrodinger equa- 
tion 



Aff|*li„) = £li„P|*lm), 



(37) 



with the normalization condition (^ol^iin) = 1, ensuring 
that the coefficient of |Wi; n ) on |^ ) = l^o) is 1 as in 
Eq. (23). 

To construct the perturbation theory, we now intro- 
duce a fictitious projected Schrodinger equation depend- 
ing on a coupling constant A 



PPf*|<>=P* n P|<) 



(38) 



In this expression, is a zeroth-order non-Hermitian 
operator 



jyOpt 



(40) 



i=0 



where fij are arbitrary energies. Clearly, admits 
\^i) as right-eigenstate and as left-eigenstate, with 
common eigenvalue The non-Hermitian perturba- 
tion operator is obviously defined as iP 1 ) — H — H^°\ 



We expand |^i in ) and E^ n in powers of A: |*ii n ) 



Er=o^l*^> and E L = Er=o^< ; - The^eroth- 
order (right) eigenstate and energy are simply: l^un) = 
l^o) and E^ = So- The first-order correction to the 
wave function is determined by the equation 

P (£«» - fib) 1*2) = -P {H (1) E^) |* >. (41) 

To solve this equation, we define the non-Hermitian pro- 
jector operator R — Ei=i which, in compari- 

son to the projector P, also removes the component par- 
allel to l^o)- Note that RP = R, R commutes with 

- £ Q and R\¥£) = |*^> (since (* |*L> = 1 and 
(*o|*o> = 1, implying (^ol^hn) = 0), so that applying 
R on Eq. (41) leads to 



(k) 



i*S> 



R 



(m-E<£A\* Q ) 

H(o) - £ V lm / ' 



Si — So 

^ ( s-\,mm m , ,42, 



i=l 

N apt N opt 



1=1 j = l 



where So and E^ in the numerator and the term 7=0 

J lin 

have been dropped since, for i / 0, (^il^o) = and 

(S )io = 0, respectively. Therefore, the parameter vari- 
ations in this first-order perturbation theory are 



N apt 



(43) 



where H j0 = <*j|H|* > = {*j\H - P |*o) = is 



(*j\H - Eo\* 
just half the gradient of the energy and ASi = Si 



S . 
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The perturbative method consists of calculating the pa- 
rameter variations Ap' 1 - 1 according to Eq. (43), updat- 
ing the current wave function, |^/ } — ► ^(p + Ap^)), 
and iterating until convergence. It is apparent from 
Eq. (43) that the perturbative method can be viewed 
as the Newton method with an approximate Hessian, 
h^j rt — (S )ij/A£i, as also noted in Ref. 17. 

The energy denominators A£i in Eq. (43) remain to be 
chosen. Since perturbation theory works best when 
is "close" to H we choose to have the same diagonal 
elements as H, resulting in 



/(R), by </(R)) = (l/M)E^i/(Rfe) where the M 
electron configurations Rfc are sampled from ^ (R) 2 - 



A£ t 



msm- Eo =Bi-H 00 . (44) 



In practice, only rough estimates of the A£j's are neces- 
sary for the optimization so that one can compute them 
for just the initial iteration and keep them fixed for the 
following iterations. Therefore, for these iterations, only 
the inverse overlap matrix, S , and the gradient of the 
energy, gj = 2 Hj , need to be calculated in the pertur- 
bative' method, leading to an important computational 
speedup per iteration in comparison to the linear method. 

Stabilization. Similarly to the linear method, the per- 
turbative method can be stabilized by adding an ad- 
justable positive constant, adiag > 0, to the energy de- 
nominators, i.e. A£i — > A£i +adi ag , which has the effect 
of decreasing the parameter variations Ap^. 

Connection to the perturbative EFP and SR methods. 
For the CSF and orbital parameters, if the energy de- 
nominators are chosen to be A£j = — 

($o|-ff|3>o)/($o|$o) (i-e., without the Jastrow factor), 
Eq. (43) exactly reduces to the perturbative EFP 
method [17]. Also, Eq. (43) reduces to the SR optimiza- 
tion method [18-20] if the energy denominators are all 
chosen equal, A£j = A£ for all i. 



IV. VARIATIONAL MONTE CARLO 
REALIZATION 

When the previously-described energy minimization 
procedures are implemented in VMC it is important to 
pay attention to the statistical fluctuations. Expressions 
that arc equivalent in the limit of an infinite Monte Carlo 
sample, can in fact have very different statistical errors 
for a finite sample. We provide prescriptions for low vari- 
ance estimators in this section. 

We also note that, in order to reduce round-off noise, 
it can help to rescale the elements of the gradient vector, 
and the hessian, Hamiltonian and overlap matrices using 
the square root of the diagonal of overlap matrix. 

At each step of the optimization, the quantum- 
mechanical averages are computed by sampling the prob- 
ability density of the current wave function * (R) 2 - We 
will denote the statistical average of a local quantity, 



A. Energy gradient and Hessian 

In terms of the derivatives "J/j (R) of the wave function 
of Eq. (4), and using the Hermiticity of the Hamiltonian 
H, an estimator of the energy gradient is [39] 



ft - 2 



gi(R) 
*o(R) 



£l(R) 



tti(R) 
*o(R) 



(£l(R)> 



(45) 



where E L (R) = [i?(R)* (R)] /*o(R) is the local en- 
ergy. In the limit that 'I'o(R) is an exact eigenfunction, 
the local energy becomes constant, _Z?l(R) = E cxact for 
all R, and thus the gradient of Eq. (45) vanishes with 
zero variance. This leads to the following zero- variance 
principle for the Newton and perturbative methods: in 
the limit that ^o(R) is an exact eigenfunction, the pa- 
rameter variations of Eqs. (21) and (43) vanish with zero 
variance. 

Taking the derivative of Eq. (45) leads to the straight- 
forward estimator of the energy Hessian of Lin, Zhang 
and Rappe (LZR) [21] 



h 



LZR. 



n ■ 



(46) 



where 



£l(R) 



*ij(R) 
*o(R) 
gi(R) tt 3 -(R) 
*o(R) *o(R) 

*i(R) tt 3 -(R) 
* (R) *o(R) 



ttij(R) 
*o(R) 



<Sl(R)) 



E L (R) 
(^l(R)) 



(47) 



involving the second derivatives (R) of the wave func- 
tion, 



B, 



^(R) ^-(R) 
*o(R) *o(R) 

^(R) tyj (R) 
*o(R) *o(R) 
*i(R) 



^l(R)^ 
(^l(R)) 




(£ L (R) - <£ L (R)}) 



(48) 
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and 



Cij — 



*o(R) 



(49) 



where E^j (R 



[ J ff(R)^(R)]/* (R) - 
[*j(R)/*o(R)] E L (R) is the derivative of the local 
energy with respect to parameter j. In this estimator of 
the Hessian, the term that fluctuates the most is dj . 

Umrigar and Filippi [23] observed that the fluctuations 
of a covariance (ab) — (a) (b) are much smaller than those 
of (ab) if the fluctuations of a are much smaller than the 
average of a, i.e., a/ (a 2 ) — (a) 2 <C \(a}\, and, a is not 
strongly correlated with 1/6. In Eq. (49), *i(R)/* (R) 
is always of the same sign for parameters in the exponent 
and in practice its fluctuations are much smaller than 
its average. Further, it follows from the Hermiticity of 
the Hamiltonian that (i?L.j(R)} vanishes in the limit of 
an infinite sample [21]. Using these two observations, 
Umrigar and Filippi [23] provided an estimator of the 
Hessian, 



UF 



A. 



B, 



D 



(50) 



that fluctuates much less than the straightforward LZR 
estimator, where the symmetrized estimator 



A, - 



tti(R) 
*o(R) 



*o(R) 



£l,;(R) 



tti(R) 
*o(R) 

*o(R) 



(^lj(R)> 



(51) 



has the same average as the term C\j in the limit of an 
infinite sample, but being a covariance has much smaller 
fluctuations. We note that Aij is already a covariance 
and Bij is a tri-covariance. 

Although the and B^ terms vanish with zero vari- 
ance in the limit that ^o(R) is an exact eigenfunction 
(the Dij term does not), in practice for the Jastrow 
parameters, far from the minimum, the B^ fluctuates 
more than the D^ term for the Jastrow parameters in 
the Hessian of Eq. (50). With the form of the Jastrow 
factors that we use, we have observed that the ratio 
(B^ + Dij) /D^ is roughly independent of i and j for 
most i and j though it changes during the Monte Carlo 
iterations. It is typically between 1.2 and 2.5 at the initial 
iteration and between 0.9 and 1.1 at the final iteration. 
We exploit this to decrease the fluctuations by defining 
a new, approximate Hessian partially averaged over the 
Jastrow parameters 



hfi 3 = A ij + 



((\B l3 + D t3 \)) 



Di 



(52) 



where TU are the initials of the present authors, and the 
average over the Jastrow parameter pairs are defined by 

((Xij)) = (2/NZ(NZ + 1)) Efi? £j? Xij- The av- 
erage is calculated as ((\Bij + Dij\)) / ((\Dij\)) and not 



as (((\Bij + Dij\)/\Dij\)) to avoid possible numerical di- 
vergences of this ratio for small D^. In Eq. (52), i and 
j refer only to Jastrow parameters. For all the terms 
related to the other parameters (including all the mixed 
terms), the Hessian of Eq. (50) is used without further 
modification. 

Exact or approximate wave functions such as 'I'o(R) 
go linearly to zero with the distance d between R and 
their nodal hypersurface, i.e., 5 , o(R) ~ d for d — > 0. 
The local energy £l(R) generally diverges as 1/d for 
d — > for approximate wave functions. In contrast 
to the case of the Jastrow parameters, the derivatives 

(R) for the CSF and orbital parameters have a dif- 
ferent nodal hypersurface than ^(R) and the ratio 
4 r i(R)/^ r o(R) thus also diverges as 1/d, even if the 
wave function v I'o(R) is exact. Consequently, the deriva- 
tive of the local energy i?L,»(R) generally diverges as 
1/d 2 for approximate wave functions. In the expres- 
sion of the Hessian, the leading divergence at the nodes 
of the approximate wave function ^o(R) thus comes 
from the terms (*i(R)/* (R)) (*j(R)/*o(R)) E L (R), 
(*i(R)/*o(R))£?Lj(R) and (* J (R)/* (R))^L,i(R) 
that behave as 1/d 3 . It is however easy to check that 
these third-order divergences cancel exactly in Eq. (50). 



B. Overlap and Hamiltonian matrices 

The elements of the symmetric overlap matrix S are 
S 00 = 1, (53a) 

and, for i,j > 0, 

S l0 = S 0j = 0, (53b) 

and 



*o(R) *o(R) / \ *o(R) / \ *o(R 

The elements of the Hamiltonian matrix H are 

H 0Q = (E h (K)), (54a) 

and, for i, j > 0, 
^o = /l4Sk(R)W^W)>, (54b) 



*o(R) 



*o(R) 



H. 



*o(R) 



e l (h) 



^(R) 
*o(R) 



(^l(R)) 



(54c) 



which are two estimators of half of the energy gradient, 
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and 



^(R) ^-(R) 
*o(R) *o(R) 

*o(R) 
*o(R) 



*o(R)/ \*o(R) 



*o(R) 
^i(R) 
*o(R) 



*o(R) 



^Lj(R) 



^l(R) 
^l(R) 

<£l(R)} 

^(R) 
*o(R) 



<^lj(R)> 



(54d) 



We do not use the Hermiticity of the Hamiltonian H to 
symmetrize the matrix H. In fact, as shown by Nightin- 
gale and Melik-Alaverdian [11], using the non-symmetric 
matrix H of Eqs. (54) leads to a stronger zero-variance 
principle than the one previously-described for the New- 
ton and perturbative methods: in the limit that the 
states {|^o)> 1^2)) • " " J^a^}} span an invariant 
subspace of the Hamiltonian H, i.e. in the limit that the 
linear wave function \I>ii n (R) of Eq. (23) after optimiza- 
tion is an exact eigenfunction, the matrix S • H and 
consequently the eigenvector solution Ap have zero vari- 
ance. In practice, even if we do not work in an invariant 
subspace of H, using the non-symmetric matrix H leads 
to smaller statistical errors on a finite sample than us- 
ing its symmetrized analog. Although in principle diag- 
onalization of a non-symmetric matrix leads to complex 
eigenvalues, in practice the physically reasonable (i.e., 
with large overlap with the current wave function) low- 
est eigenvectors have usually real eigenvalues. Of course, 
in the limit of an infinite sample M —* 00 a symmetric 
matrix H is recovered. 

As noted in the previous subsec- 
tion for the Hessian, although the terms 



(*i (R)/*o (R) ) (*j (R)/*o (R) ) E h (R) 



and 



(*i(R)/*o(R))^Lj(R) in the expression of the 
Hamiltonian matrix of Eq. (54d) display a third-order 
divergence 1/d 3 as the distance d between R and the 
nodal hypersurface of 5 , o(R) g° cs to zero, again these 
divergences cancel exactly. 



C. Comparison of computational cost per iteration 

At each optimization iteration, besides the calculation 
the current wave function ^(R) and the local energy 
_Z?l(R), the Newton method requires the computation 
of the first-order and second-order wave function deriva- 
tives, ^i(R) and ^ (R), and the first-order derivatives 
of the local energy £x, j(R). The linear method requires 
the calculation of *,(R) and £?L,j(R) but not of the 



second-order derivatives of the wave function with re- 
spect to the parameters. In principle, this decreases the 
computational cost per iteration, especially if the many 
orbital-orbital second-order derivatives were to be com- 
puted in the Newton method. In practice, since our 
implementation of the Newton method neglects these 
orbital-orbital derivatives, the computational cost per it- 
eration of the Newton and linear methods is very similar. 

The perturbative method requires the computation of 
the same quantities as the linear method. However, since 
the method is not very sensitive to having accurate en- 
ergy denominators A£j in Eq. (43), and since the energy 
denominators do not undergo large changes from iter- 
ation to iteration we compute these for the first itera- 
tion only. Hence it is not necessary to compute i?L,i(R) 
for subsequent iterations. This leads to a computa- 
tional speedup per iteration in comparison with the linear 
method. The precise speedup factor depends on the wave 
function used; typically, for the systems studied here, we 
have found factors ranging from about 1.5 for a single- 
determinant wave function to 5.5 for the largest multi- 
determinant wave function considered, for the iterations 
for which the Afj's are not computed. 



V. COMPUTATIONAL DETAILS 

We illustrate the optimization methods by calculating 
the ground-state electronic energy of the all-electron C2 
molecule at the experimental equilibrium interatomic dis- 
tance of 2.3481 Bohr [40]. The ground-state wave func- 
tion is of symmetry in the point group -Dqo/i- The es- 
timated exact, infinite nuclear mass, nonrelativistic elec- 
tronic energy is —75.9265(8) Hartree [41], where the num- 
ber in parentheses is an estimate of the uncertainty in the 
last digit. This system has a strong multiconfiguration 
character due to the energetic near-degeneracy of the va- 
lence orbitals, making it a challenging system despite its 
small size. 

We start by generating a standard ab initio 
wave function using the quantum chemistry program 
GAMESS [42], typically a restricted Hartree- Fock (RHF) 
wave function or a MCSCF wave function, using the 
symmetry point group D^h which is the largest sub- 
group of Dooh available in GAMESS. i We use the uncon- 
tracted Slater basis set form of Clementi and Roetti [43] , 
with exponents reoptimized at the RHF level by Koga 
et al. [44]. For carbon, the basis set contains two Is, 
three 2s, one 3s and four 2p Slater functions, that 
are each approximated by a fit to six Gaussian func- 
tions [45, 46] in GAMESS. Specifically, we consider the 
following ab initio wave functions: a RHF wave func- 
tion, with orbital occupations \u^ g \a^a 2 g 2u\\^\ x \-K\ y \ 
a CAS (8,5) wave function, containing 6 CSFs in D±h sym- 
metry made of 7 Slater determinants generated by dis- 
tributing the 8 valence electrons over the 5 active valence 
orbitals 2o~ g 2o~ u l-K u ^ x l-K UyV ?>o~ g ] a CAS(8,7) wave func- 
tion, containing 80 CSFs made of 165 determinants with 
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the 7 active orbitals 2<7 ff 2<7 u l7r UlX l7r UlK 3<7 ff l7r ffi a;l7r fflI ,; 
a CAS(8,8) wave function, containing 264 CSFs 
made of 660 determinants with the 8 active orbitals 
2a g 2cT u l-K u ^ x l-K u ^yZag\iTg^ x lTTg y yZa Ul i.e. all the valence 
orbitals originating from the n — 2 shell of the C atoms. 
In addition, we construct a larger one-electron basis set 
by adding to the basis of Koga et al. one d function with 
an exponent of 2.13 optimized in RHF and we consider 
a wave function obtained from a restricted active space 
(RAS) calculation in this basis that would correspond to 
a CAS(8,26) calculation, using all the 26 orbitals origi- 
nating from the n = 2 and n ~ 3 shells of the C atoms, 
but where only single (S), double (D), triple (T) and 
quadruple (Q) excitations are allowed in the active space. 
This wave function, that we denote by RAS-SDTQ(8,26), 
contains 110481 CSFs made of 411225 determinants. 



The standard ab initio wave function is then multi- 
plied by a Jastrow factor, imposing the electron-electron 
cusp conditions, but with essentially all other free pa- 
rameters chosen to be to form our starting trial wave 
function. QMC calculations are performed with the pro- 
gram CHAMP [47], using this time the true Slater basis 
set rather than its Gaussian expansion. In comparison to 
GAMESS, additional symmetries outside the point group 
D^h are detected numerically which allows one to reduce 
the numbers of CSFs to 5, 50 and 165 for the CAS(8,5), 
CAS(8,7) and CAS(8,8) wave functions, respectively. For 
the large RAS-SDTQ(8,26) wave function, only a fraction 
of all the CSFs are retained in QMC by applying a vari- 
able cutoff on the CSF coefficients and an extrapolation 
procedure is used to estimate the QMC result if all the 
CSFs had been included (see Sec. VIE). For the orbital 
optimization, only the single excitations between orbitals 
of the same irreducible representation of D^h are gener- 
ated. We, however, impose no restriction inside each of 
the two-dimensional irreducible representations 7r„ and 
■Kg. Although one can in principle identify the tt x and 
TT y components and forbid excitations between these two 
components to further reduce the number of free param- 
eters, these redundancies appear to cause no problem in 
practice during the optimization. Also, we impose the 
electron-nucleus cusp condition on each orbital. The pa- 
rameters of the trial wave function are optimized by the 
previously-described energy minimization procedures in 
VMC, using a very efficient accelerated Metropolis algo- 
rithm [48, 49], allowing us to simultaneously make large 
Monte Carlo moves in configuration space and have a 
high acceptance probability. Once a trial wave func- 
tion has been optimized, we perform a DMC calculation 
within the fixed-node and the short-time approximations 
(see, e.g., Refs. 50-53). We use an imaginary time step 
of r = 0.01 H _1 in an efficient DMC algorithm featuring 
very small time-step errors [54], so that the accuracy is 
essentially limited by the quality of the nodal hypersur- 
face of the trial wave function. 
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FIG. 1: Convergence of the VMC total energy -Evmc of the 
all-electron C2 molecule during the optimization of the 24 
Jastrow parameters in a wave function composed of the RHF 
Slater determinant multiplied by a Jastrow factor. The linear, 
perturbative and Newton energy minimization methods are 
compared. For the Newton method, the results obtained with 
the UF Hessian of Eq. (50) and the TU Hessian of Eq. (52) are 
shown. The statistical error on the energy at each iteration is 
0.5 mHa. The insert is an enlargement of the last 6 iterations. 



VI. RESULTS AND DISCUSSION 
A. Optimization of the Jastrow factor 

We first study the convergence behavior of the energy 
minimization methods for the separate optimization of 
the Jastrow, CSF and orbital parameters. To facilitate 
comparisons, we apply the VMC optimization procedures 
with a common fixed statistical error of the energy at 
each step, namely 0.5 mHa. This is not the usual way 
in which wc routinely perform optimizations which is de- 
scribed later in Sec. VI D. 

Figure 1 shows the convergence of the total VMC en- 
ergy during the optimization of the 24 Jastrow param- 
eters in a wave function composed of the RHF Slater 
determinant multiplied by a Jastrow factor. The linear, 
perturbative and Newton methods are compared. For 
the Newton method, we present the results obtained with 
the UF Hessian of Eq. (50), already used in Ref. 23, and 
with the TU Hessian of Eq. (52). To compare the fluc- 
tuations of these two Hessians, we have computed the 



quantity ry = l/ N °v\N°& + 1) V^ EjLi MM) 2 
where (<r(hij)) 2 is the variance of the element hij of 
the Hessian averaged over 100 Monte Carlo configura- 
tions. For the initial iteration of the optimization, far 
from the energy minimum, the UF Hessian fluctuates 
more than the TU Hessian by a factor ry UF /7y TU = 3.6. 
For comparison, the LZR Hessian of Eq. (46) fluctuates 
more than the TU Hessian by a factor jj LZR /jy TU — 150, 
more than two orders of magnitude larger even for this 
modest system. Near the energy minimum, the factors 
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FIG. 2: Convergence of the VMC total energy Bvmc of the 
all-electron C2 molecule during the optimization of the 49 
CSF parameters in a wave function composed of a CAS(8,7) 
part multiplied by a previously-optimized Jastrow factor. The 
linear, perturbative and Newton [with the UF Hessian of 
Eq. (50)] energy minimization methods are compared. The 
statistical error on the energy at each iteration is 0.5 mHa. 



FIG. 3: Convergence of the VMC total energy Bvmc of the 
all-electron C2 molecule during the optimization of the 44 
orbital parameters in a wave function composed of a single 
Slater determinant multiplied by a previously-optimized Jas- 
trow factor. The linear, perturbative and Newton [with the 
UF Hessian of Eq. (50)] energy minimization methods are 
compared. The statistical error on the energy at each itera- 
tion is 0.5 mHa. 



B. Optimization of the CSF coefficients 



are ?7 UF /r; TU = 3.3 and ^LZR/^tu = 600 Thege fac _ 
tors tend to increase with the system size. The Newton 
method with the UF Hessian converges reasonably fast 
in about 6 iterations, which is a little faster than the 
convergence shown in Figs. 1,2 and 4 of Ref. 23 due to 
the previously-described correlated sampling adjustment 
of the stabilizing constant cidiag in the course of the op- 
timization and despite the fact that we are performing 
an all-electron rather than a pseudopotential calculation 
here [55]. The Newton method with the TU Hessian dis- 
plays an even faster convergence, the energy being essen- 
tially converged within the statistical error at iteration 3 
or 4. The linear method has a similar convergence rate to 
the Newton method with the TU Hessian. The Newton 
method with the TU Hessian and the linear method are 
both stable even without stabilization if sufficiently large 
Monte Carlo samples are used. When stabilization is 
employed, the stabilization constant etdiag remains small 
during the optimization, in this example from 1CU 3 for 
the initial iteration to 1CU 7 for the last iterations which 
is 2 or 3 orders of magnitude smaller than the values of 
(idiag in the Newton method with the UF Hessian. The 
perturbative method, in contrast, converges very slowly. 
In fact, it turns out that the energy denominators for 
the Jastrow parameters, A£ a; , calculated according to 
Eq. (44), are all of order unity and adi ag needs to be 
increased to as much as 10 2 to retain stability. In this 
case, the perturbative method essentially reduces to the 
inefficient SR optimization technique. 



Figure 2 shows the convergence of the total VMC en- 
ergy during the optimization of the 49 CSF parameters in 
a wave function composed of a CAS(8,7) determinantal 
part multiplied by a previously-optimized Jastrow factor, 
using the linear, perturbative and Newton [with the UF 
Hessian of Eq. (50)] methods. The linear method con- 
verges in 1 iteration, as it must, and does not require 
any stabilization. When stabilization is used, a^iag re- 
mains as low as 10~ 6 to 10~ 8 during the whole optimiza- 
tion. The Newton and perturbative methods converge 
in 2 or 3 iterations, and are not as intrinsically stable, 
Qdiag being a few orders of magnitude larger for the New- 
ton method and several orders of magnitude larger for 
the perturbative method. The energy denominators for 
the CSF parameters in the perturbative method, A£ CI , 
calculated according to Eq. (44), span only one order of 
magnitude. 

C. Optimization of the orbitals 

Figure 3 shows the convergence of the total VMC en- 
ergy during the optimization of all the 44 orbital param- 
eters in a wave function composed of a single Slater de- 
terminant multiplied by a previously-optimized Jastrow 
factor, using the linear, perturbative and Newton [with 
the UF Hessian of Eq. (50)] methods. The three methods 
display very similar convergence rates, the energy being 
converged within the statistical error in 1 iteration using 
any of the three methods. In this example, the linear and 
perturbative methods converged even without stabiliza- 
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FIG. 4: Total VMC energy E VM c of the all-electron C 2 
molecule with respect to the number of virtual orbitals in- 
cluded in the optimization of the orbital parameters in a 
wave function composed of a single Slater determinant mul- 
tiplied by a previously-optimized Jastrow factor, using RHF 
and RKS B3LYP starting orbitals. The orbitals are ordered 
according to their energies. The statistical error on the energy 
is 0.5 mHa. 



tion whereas the Newton method required stabilization. 
The energy denominators for the orbital parameters in 
the perturbative method, ASm, calculated according to 
Eq. (44), typically span two orders of magnitude from 1 
to 100. 

In the previous orbital optimization, we have consid- 
ered a full optimization of all the orbital parameters, i.e. 
all the allowed excitations from the 6 closed occupied 
orbitals to the 30 virtual orbitals were included in the 
calculation. One may also consider a partial orbital opti- 
mization by restricting the excitations to the lowest sev- 
eral virtual orbitals, as also proposed within the EFP or 
perturbative EFP approaches [56]. This allows one to 
reduce the computational effort and also to decrease the 
statistical noise in the calculation since it is the excita- 
tions to the highest-lying virtual orbitals that modify the 
most the nodal structure of the wave function, leading 
to large fluctuations of the ratio ^(RJ/vpr^R). Fig. 4 
shows the total VMC energy with respect to the number 
of virtual orbitals included in the optimization for a wave 
function composed of a single Slater determinant multi- 
plied by a previously-optimized Jastrow factor. Two sets 
of starting orbitals are compared: orbitals obtained from 
a RHF calculation and orbitals obtained from a restricted 
Kohn-Sham (RKS) calculation with the hybrid exchange- 
correlation functional B3LYP [57, 58], using the ordering 
given by the orbital energies. In both cases, as expected, 
the energy decreases monotonically within the statistical 
error as the number of virtual orbitals included in the 
optimization increases. However the slope of the energy 
does not change monotonically and it is necessary to in- 
clude almost all the orbitals to get close to the optimal 
energy. From Fig. 4 we see that for the C2 molecule the 



B3LYP orbitals provide a better starting point than the 
RHF orbitals. In our experience, this is often but not 
always the case. It is possible that the selection of the 
virtual orbitals adopted here, based on the orbital energy 
ordering, may not be the best choice and other selections 
based on symmetry or chemical intuition could lead to a 
more rapid convergence. 

Note that Fig. 4 was obtained by just optimizing the 
orbital parameters for a fixed, previously-optimized Jas- 
trow factor. If instead the Jastrow and orbital param- 
eters are optimized simultaneously a significantly lower 
energy is obtained, e.g. including all 30 virtual orbitals 
gives an energy of -75.8069(5) Ha (see Table I) as opposed 
to -75.7845(5) Ha in Fig. 4. 

To summarize, the Newton and the linear methods 
converge very rapidly when optimizing any kind of pa- 
rameter, though the linear method is more stable for the 
optimization of the determinantal part of the wave func- 
tion. The perturbative method is a good, less expensive 
alternative for the optimization of the orbital parameters 
and, to a lesser extent, for the optimization of the CSF 
parameters, but is very slowly convergent for the Jastrow 
parameters. 

It is clear from Eq. (43) that the perturbative method 
can be viewed as a Newton method with an approximate 
Hessian. The poor behavior of the perturbative method 
for the Jastrow parameters means that this Hessian is 
a bad approximation to the exact Hessian, whose eigen- 
values span more than 10 orders of magnitude for these 
parameters. In fact, any method based on an approxi- 
mate Hessian that is not able to reproduce all these or- 
ders of magnitude, such as the steepest-descent method, 
is bound to converge very slowly. On the other hand, the 
eigenvalues of the Hessian for the CSF and orbital pa- 
rameters span only a couple of orders of magnitude and 
the approximate Hessian of the perturbative method is 
sufficient to allow rapid convergence. 



D. Optimization of all the parameters: 
simultaneous or alternated optimization? 

After having studied the behavior of the energy min- 
imization methods for the optimization of each kind of 
parameter, we now move on to the more practical prob- 
lem of how to optimize all the parameters. 

The most obvious possibility is to optimize simulta- 
neously the Jastrow, CSF and orbital parameters using 
the linear method, the method having the best overall 
efficiency for all these parameters. In practice, we pro- 
ceed as follows. We start an optimization run with a 
short Monte Carlo simulation with a large statistical er- 
ror (e.g., 0.02 Ha for the C2 molecule), and we decrease 
progressively the statistical error at each iteration until 
the energy is converged to 10 -3 Ha for three consecu- 
tive iterations. We choose the optimal parameters to be 
those from the iteration with the smallest value of -EVmc 
plus three times the statistical error of £VmC; which is 
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FIG. 5: Convergence of the VMC total energy -Evmc (left plot) and of the VMC standard deviation of the local energy 
o"vmc (right plot) of the all-electron C2 molecule during the simultaneous optimization of the 24 Jastrow parameters, 49 CSF 
parameters and 64 orbital parameters in a wave function composed of the CAS(8,7) determinantal part multiplied by a Jastrow 
factor, using the linear energy minimization method. The statistical error on the energy is initially of 0.02 Ha and is decreased 
by a factor 2 at each iteration until 0.5 mHa. The insets are enlargements of the last 5 iterations. 



often but not always the last iteration. A typical exam- 
ple of the convergence of the total VMC energy and of 
the standard deviation ovmc is shown in Fig. 5 for the si- 
multaneous optimization of the Jastrow, CSF and orbital 
parameters in a wave function composed of a CAS(8,7) 
determinantal part multiplied by a Jastrow factor. In 
this case, the energy converges in 4 or 5 iterations. The 
standard deviation typically converges a little slower than 
the energy since we are optimizing just the energy here. 
A faster convergence, to a somewhat smaller value of the 
standard deviation, can be achieved by optimizing a lin- 
ear combination of the energy and variance as in Ref. [23] . 

Another possibility is to alternate between the opti- 
mization of the different kinds of parameters until global 
convergence. This has the advantage of allowing one to 
use different optimization methods for the various pa- 
rameters, e.g. optimization of the Jastrow factor and the 
CSF coefficients with the Newton or linear method and 
optimization of the orbitals with the less expensive but 
still very efficient perturbative method. Fig. 6 shows the 
convergence of the total VMC energy and of the standard 
deviation during the alternated optimization of the Jas- 
trow parameters and of the orbital parameters in a wave 
function composed of a single Slater determinant multi- 
plied by a Jastrow factor for the all-electron C2 molecule. 
The convergence of the energy is surprisingly very slow, 
the convergence of the standard deviation is even worse. 
This is an indication of the presence of a strong coupling 
between some Jastrow and orbital parameters. This situ- 
ation is in sharp contrast with the case where a pseudopo- 
tential is used to remove the core electrons. Fig. 7 shows 
the convergence of the total VMC energy during the al- 
ternated optimization of the Jastrow parameters and of 
the orbital parameters in a wave function composed of a 
single Slater determinant multiplied by a Jastrow factor 



for the C2 molecule with a Hartree-Fock pseudopoten- 
tial [59] and an adequate Gaussian one-electron basis set. 
The convergence is very fast, the energy being essentially 
converged within the statistical error in 1 macroitera- 
tion. This favorable behavior has already been observed 
in other systems with pseudopotentials [56] , but we have 
also found pseudopotential systems for which the conver- 
gence is not as fast. 

For the all-electron case, it thus seems that simultane- 
ous optimization of the parameters is much preferable. 
The coupling between the different parameters seems to 
be too strong to allow an efficient alternated optimiza- 
tion. For large systems most of the wave function pa- 
rameters are orbital and CSF parameters for which the 
perturbative method works well. It seems then promising 
to simultaneously optimize all the parameters with the 
Newton or the linear methods, using for the part of the 
Hessian or the Hamiltonian matrices involving the CSF 
and orbital coefficients rough approximations inspired by 
the perturbative method [60]. 



E. Systematic improvement by wave function 
optimization 

Table I reports the total VMC and DMC energies, 
J?vmc and EdmCi an d the VMC standard deviation of 
the local energy <t V mc = \J (-E'l(R) 2 ) — (-E'l(R-)) 2 for the 
different trial wave functions considered. For the single- 
determinant, CAS(8,5), CAS(8,7) and CAS(8,8) wave 
functions, we present the results for three levels of op- 
timization. At the first level, only the Jastrow factor 
is optimized. At the second level, the Jastrow factor 
and the CSF coefficients are optimized together. At the 
third level, the Jastrow factor, the CSF coefficients and 
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FIG. 6: Demonstration of the slow convergence of the VMC total energy £Vmc (left plot) and of the VMC standard deviation 
of the local energy ovmc (right plot) of the all-electron C2 molecule during the alternated optimization of the 24 Jastrow 
parameters and 44 orbital parameters in a wave function composed of a single Slater determinant multiplied by a Jastrow factor. 
The half-integer macroiteration numbers correspond to the optimization of the Jastrow factor and the integer macroiteration 
numbers correspond to the optimization of the orbitals. The statistical error on the energy is always 0.5 mHa. The simultaneous 
optimization of the Jastrow and orbital parameters gives an energy of —75.8069(5) Ha and a standard deviation of 1.1 Ha, 
indicated on the plots by horizontal lines. 



Wave function form 


Parameters optimized in VMC 


-Evmc -Bdmc ovmc 


Jastrow x determinant 


Jastrow (24) 

Jastrow (24) + orbitals (44) 


-75.7648(5) -75.8570(5) 1.4 
-75.8069(5) -75.8682(5) 1.1 


Jastrow x CAS(8,5) 


Jastrow (24) 

Jastrow (24) + CSFs (6) 

Jastrow (24) + CSFs (6)+ orbitals (52) 


-75.8045(3) -75.8750(5) 1.3 
-75.8094(5) -75.8807(5) 1.3 
-75.8374(5) -75.8882(5) 1.0 


Jastrow x CAS(8,7) 


Jastrow (24) 

Jastrow (24) + CSFs (49) 

Jastrow (24) + CSFs (49) + orbitals (64) 


-75.8469(5) -75.8973(5) 1.2 
-75.8546(5) -75.9032(5) 1.2 
-75.8769(5) -75.9092(5) 0.9 


Jastrow x CAS(8,8) 


Jastrow (24) 

Jastrow (24) + CSFs (164) 

Jastrow (24) + CSFs (164) + orbitals (70) 


-75.8462(5) -75.8999(6) 1.1 
-75.8562(5) -75.9050(6) 1.1 
-75.8801(6) -75.9099(5) 0.9 


Jastrow x RAS-SDTQ(8,26) 


Jastrow + CSFs + orbitals (extrapolation) 


-75.9016(5) -75.9191(5) 


Exact 


-75.9265(8) 



TABLE I: Total VMC and DMC energies, -Evmc and J?dmc, and VMC standard deviation of the local energy ctvmc of the C2 
molecule for different trial wave functions and different levels of optimization. The kind and number of optimized parameters 
are indicated. When not optimized in VMC, the CSF and orbital coefficients have been fixed at their RHF values for the single- 
determinant case and at their CAS MCSCF values for the multiconfiguration cases. For the large Jastrow x RAS-SDTQ(8,26) 
wave function, the VMC and DMC values are obtained by an extrapolation procedure (see Sec. VIE and Fig. 8). For the 
energies, the numbers in parentheses are estimates of the statistical error on the last digit. All units are Hartree. 



the orbitals are all optimized together. Going from one 
level to the next one improves the accuracy of the wave 
function but also increases the computational cost of the 
optimization. We note that it is important to reoptimize 
the determinantal (CSF and orbital) parameters, along 
with the Jastrow parameters, rather than keeping them 
fixed at the values obtained from the MCSCF wave func- 
tions. For each wave function, the effect of reoptimizing 
the determinantal part is to lower the VMC energy by 
about 0.03 to 0.04 Ha, and the standard deviation of the 
energy by about 0.2 to 0.3 Ha. More remarkably, even 
though the optimization is performed at the VMC level, 



the DMC energy also goes down by about 0.01 Ha, imply- 
ing that the nodal hypersurface of the trial wave function 
also improves. In addition, one observes a systematic im- 
provement of the VMC and DMC energies when the size 
of the CAS increases, provided that at least the CSF co- 
efficients are reoptimized with the Jastrow factor. 

Including all the 110481 CSFs of the RAS-SDTQ(8,26) 
wave function is too costly in quantum Monte Carlo but 
one can use a series of truncated wave functions obtained 
by retaining only small numbers of CSFs with coefficients 
larger in absolute value than a variable cutoff, and then 
estimate the energy by extrapolation to the limit that 
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FIG. 7: Convergence of the VMC total energy -Evmc of the C2 
molecule with a pseudopotential removing the core electrons 
during the alternated optimization of the 24 Jastrow param- 
eters and 42 orbital parameters in a wave function composed 
of a single Slater determinant multiplied by a Jastrow factor. 
The half-integer macroiteration numbers correspond to the 
optimization of the Jastrow factor and the integer macroiter- 
ation numbers correspond to the optimization of the orbitals. 
The statistical error on the energy is always 0.5 mHa. The 
simultaneous optimization of the Jastrow and orbital param- 
eters gives an energy of —11.0030(5) Ha, indicated on the plot 
by horizontal lines. 



all the CSFs are kept. Fig. 8 shows the VMC and DMC 
energies obtained with these truncated, fully reoptimized 
multi-determinantal wave functions with respect to the 
sum of the squares of the MCSCF CSF coefficients re- 
tained, V;^ SF (c,f CSCF ) 2 . Since the RAS-SDTQ(8,26) 
wave function is normalized, the latter quantity is equal 
to 1 in the limit where all the CSFs are kept in the wave 
function. Experience shows that the energies are well 
extrapolated by quadratic fits. The extrapolated DMC 
energy is —75.9191(5) which amounts for 98.6% of the 
correlation energy (using the HF energy of -75.40620 Ha 
calculated in Ref. 40). 

On the other hand, to calculate accurate well depths 
(dissociation energy + zero-point energy) it is often suf- 
ficient to rely on some partial cancellation of error be- 
tween the atom and the molecule by employing atomic 
and molecular wave functions that are consistent with 
each other. For example, using the DMC energy of 
the C2 molecule given by the Jastrow-Slater full-valence 
CAS(8,8) wave function and the DMC energy of the C 
atom given by the consistent Jastrow-Slater full-valence 
CAS (4,4) wave function with the same one-electron basis 
leads to a well depth of 6.46(1) eV, in perfect agreement 
within the uncertainty with the exact, nonrelativistic well 
depth estimated at 6.44(2) eV [41, 61]. In contrast, the 
well depth calculated from MCSCF with the molecular 
CAS(8,8) and atomic CAS(4,4) wave functions (without 
Jastrow factor) is 5.62 eV, in poor agreement with the 
exact value. 




-75.92 - 
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FIG. 8: VMC and DMC energies obtained with the truncated, 
fully reoptimized Jastrow-Slater RAS-SDTQ(8,26) wave func- 
tions with respect to the sum of the squares of the MCSCF 
CSF coefficients retained, "£i=i F (cf CSCF ) 2 . The latter quan- 
tity is equal to 1 in the limit where all the CSFs of the RAS- 
SDTQ(8,26) calculation are kept in the wave function which 
is extrapolated by quadratic fits. 



VII. CONCLUSIONS 

We have studied three wave function optimization 
methods based on energy minimization in a VMC con- 
text: the Newton, linear and perturbative methods. 
These general methods have been applied here to the op- 
timization of wave functions consisting of a multiconfig- 
uration expansion multiplied by a Jastrow factor for the 
all-electron C2 molecule. The Newton and linear meth- 
ods are both very efficient for the optimization of the 
Jastrow, CSF and orbital parameters, the linear method 
being generally more stable. The less computationally 
expensive perturbative method is efficient only for the 
CSF and orbital parameters. We have used the linear 
method to simultaneously optimize the Jastrow, CSF and 
orbital parameters, a much more efficient procedure than 
alternating between optimizing the different kinds of pa- 
rameters. The linear method is capable of yielding not 
only ground state energies but excited state energies as 
well [11]. 

Although the optimization is performed at the VMC 
level, we have observed for the C2 molecule studied here, 
as well as for other systems not discussed in the present 
paper, that as more parameters are optimized the DMC 
energies decrease monotonically, implying that the nodal 
hypersurface also improves monotonically. In fact, a se- 
quence of trial wave functions consisting of multiconfigu- 
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ration expansions of increasing sizes multiplied by a Jas- 
trow factor, with all the Jastrow, CSF and orbital pa- 
rameters optimized together allows one to systematically 
reduce the fixed-node error of DMC calculations for the 
systems studied. 

Future directions for this work include optimization of 
the exponents of the one-electron basis functions (either 
Slater or Gaussian functions), direct optimization of the 
DMC energy and optimization of the geometry. 
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